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This paper brings a contribution to the Bayesian theory of non- 
parametric and semiparametric estimation. We are interested in the 
asymptotic normality of the posterior distribution in Gaussian linear 
regression models when the number of regressors increases with the 
sample size. Two kinds of Bernstein-von Mises theorems are obtained 
in this framework: nonparametric theorems for the parameter itself, 
and semiparametric theorems for functionals of the parameter. We 
apply them to the Gaussian sequence model and to the regression of 
functions in Sobolev and C a classes, in which we get the minimax 
convergence rates. Adaptivity is reached for the Bayesian estimators 
of functionals in our applications. 

1. Introduction. To estimate a parameter of interest in a statistical mo- 
del, a Bayesian puts a prior distribution on it and looks at the posterior dis- 
tribution, given the observations. A Bernstein-von Mises theorem is a result 
giving conditions under which the posterior distribution is asymptotically 
normal, centered at the maximum likelihood estimator (MLE) of the model 
used, with a variance equal to the asymptotic frequentist variance of the 
MLE. Other centering can be used; see, for instance, van der Vaart (1998), 
page 144, after the proof of Lemma 10.3. 

Such an asymptotic posterior normality is important because it allows the 
construction of approximate credible regions, based on the posterior distri- 
bution, which retain good frequentist properties. In particular, the Monte 
Carlo Markov chain algorithms (MCMC) make feasible the construction of 
Bayesian confidence regions in complex models, for which frequentist confi- 
dence regions are difficult to build; however, Bernstein-von Mises theorems 
are difficult to derive in complex models. 
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Note that the Bernstein-von Mises theorem also has links with informa- 
tion theory [see Clarke and Barron (1990) and Clarke and Ghosal (2010)]. 

For parametric models, the Bernstein-von Mises theorem is a well-known 
result, for which we refer to van der Vaart (1998). In nonparametric models 
(where the parameter space is infinite-dimensional or growing) and semi- 
parametric models (when the parameter of interest is a finite-dimensional 
functional of the complete infinite-dimensional parameter), there are still 
relatively few asymptotic normality results. Freedman (1999) gives negative 
results, and we recall some positive ones below. However, many recent papers 
deal with the convergence rate of posterior distributions in various settings, 
which is linked with the model complexity: we refer to Ghosal, Ghosh and 
van der Vaart (2000), Shen and Wasserman (2001) as early representatives 
of this school. 

Nonparametric Bernstein-von Mises theorems have been developed for 
models based on a sieve approximation, where the dimension of the pa- 
rameter grows with the sample size. In particular, two situations have been 
studied: regression models in Ghosal (1999); exponential models in Ghosal 
(2000), Clarke and Ghosal (2010) and Boucheron and Gassiat (2009) (this 
last one deals with the discrete case, when the observations follow some 
unknown infinite multinomial distribution). 

In semiparametric frameworks the asymptotic normality has been ob- 
tained in several situations. Kim and Lee (2004) and Kim (2006) study the 
nonparametric right-censoring model and the proportional hazard model. 
Castillo (2010) obtains Bernstein-von Mises theorems for Gaussian process 
priors, in the semiparametric framework where the unknown quantity is 
(0,f), with 6 the parameter of interest and / an infinite-dimensional nui- 
sance parameter. See also Shen (2002). Rivoirard and Rousseau (2009) ob- 
tain the Bernstein-von Mises theorem for linear functionals of the density 
of the observations, in the context of a sieve approximation: sequences of 
spaces with an increasing dimension k n are used to approximate an infinite- 
dimensional space. These authors achieve also the frequentist minimax es- 
timation rate for densities in specific regularity classes with a deterministic 
(nonadaptive) value of the dimension k n . 

Here we obtain nonparametric and semiparametric Bernstein-von Mises 
theorems in a Gaussian regression framework with an increasing number of 
regressors. We address two challenging problems. First, we try to understand 
better when the Bernstein-von Mises theorem holds and when it does not. In 
the latter case the Bayesian credible sets no longer preserve their frequentist 
asymptotic properties. Second, we look for adaptive Bayesian estimators in 
our semiparametric settings. 

Our nonparametric results cover the case of a specific Gaussian prior, and 
the case of more generic smooth priors. They are said to be nonparametric 
because we use sieve priors, that is, the dimension of the parameter grows. 
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These results improve on the preceding ones by Ghosal (1999) which did 
not suppose the normality of the errors but imposed other conditions, in 
particular, on the growth rate of the number of regressors. We apply our 
results to the Gaussian sequence model, as well as to periodic Sobolev classes 
and to regularity classes C Q [0,1] in the context of the regression model 
(using, resp., trigonometric polynomials and splines as regressors). In all 
these situations we get the asymptotic normality of the posterior in addition 
to the minimax convergence rates, with appropriate (nonadaptive) choices of 
the prior. We also show that for some priors known to reach this convergence 
rate, the Bernstein-von Mises theorem does not hold. 

We derive also semiparametric Bernstein-von Mises theorems for linear 
and nonlinear functionals of the parameter. The linear case is an immediate 
corollary of the nonparametric theorems and does not need any additional 
conditions. We apply these results to the periodic Sobolev classes to estimate 
a linear functional and the 1? norm of the regression function / when it is 
smooth enough, and in both cases we are able to build an adaptive Bayesian 
estimator which achieves the minimax convergence rate in all classes of the 
collection, in addition to the asymptotic normality. 

The paper is organized as follows. We present the framework in Section 2. 
Section 3 states the nonparametric Bernstein-von Mises theorems, for Gaus- 
sian and non-Gaussian priors. In Section 4 we derive the semiparametric 
Bernstein-von Mises theorems for linear and nonlinear functionals of the 
parameter. Then in Section 5 we give applications to the Gaussian sequence 
model, and to the regression of a function in a Sobolev and C Q [0, 1] class. 
In Section 6 the nonparametric and semiparametric Bernstein-von Mises 
theorems are proved. The appendices contain various technical tools used in 
the main analysis; the appendices can be found in the supplemental article 
[Bontemps (2011)]. 

2. Framework. We consider a Gaussian linear regression framework. For 
any n > 1, our observation Y = (Y\, . . . , Y n ) E W 1 is a Gaussian random vec- 
tor 



where the vector of errors e = (ei,... ,e n ) ~ A/"(0,<r^7 n ), with I n the n x n 
identity matrix, and the mean vector F belongs to W 1 . Note that the di- 
mension of Y is the sample size n, and that cr^ is known but may depend 
on n. Let Fq be the true mean vector of Y with distribution J\f(Fo,a^I n ). 
Probability expectations under Fq are denoted Pp and E. 

Let 4>\,..., 4>k n a collection of k n linearly independent regressors in R n , 
where k n <n grows with n. We gather these regressors in the n x /c n -matrix $ 
of rank k n , and (cj)) = {&9 : 8 = . . . , 6k n ) & R fcn } denotes their linear span. 
The Bernstein-von Mises theorems will be stated in association with (</>), the 
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vector space of possible mean vectors in the model, which is possibly mis- 
specified. We denote by Pg the probability distribution of a random variable 
following J\f(^6,a 2 I n ) and Eg the associated expectation. 

As examples, we present three different settings, each with its own collec- 
tion of regressors. In Section 5 the Bernstein- von Mises theorems are applied 
to each of these frameworks: 

(1) The Gaussian sequence model. Our first application concerns the 
Gaussian sequence model, which is also equivalent to the white noise model 
[see Massart (2007), Chapter 4, e.g.]. We consider the infinite-dimensional 
setting 

(2) Y^q+^tj, j>i, 

where the random variables > 1 are independent and have distribu- 
tion JV(0, 1). Projecting on the first k n coordinates with k n < n, we retrieve 
our model (1) with O = (6»°)i< i < fc „, a n = 1/y/n and $ T <I> = I kn . 

(2) Regression of a function in a Sobolev class. Let / : [0, 1] — > R be a func- 
tion in L 2 ([0, 1]). We observe realizations of random variables 

(3) Y l = f(i/n)+e i 

for 1 < i < n, where the errors are i.i.d. A/"(0,<r 2 ) and a n does not depend 
on n. 

We denote by (</?j)j>i the Fourier basis 
V\ = 1, 

(4) (f2m( x ) = cos(27rma;) Vm > 1, 

f2m+l(x) = \ // 2sin(27rmx) Vm > 1. 

In conjunction with the regular design Xi = i/n for 1 < i < n, this gives 
the collection of regressors 

<f>j = (^(i/ n ))i<i<n» 1 ^ 3 < K- 

In practice, we suppose that / belongs to one of the periodic Sobolev 
classes: 



Definition 1. Let a > and L > 0. Let (<pj)j>i denote the Fourier 
basis (4). We define the Sobolev class W(a,L) as the collection of all func- 
tions / = Y^jLi Qj^j m ^ 2 ([0) 1]) such that 6 = {0j)j>\ is an element of the 
ellipsoid of £ 2 (N), 

e(a,L) = |^£ 2 (N):^a^<-^|, 
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where 
(5) 



j a , if j is even; 

(j — l) a , if j is odd. 



(3) Regression of a function in C a [0, 1] . Let a > 0, and / E C a [0, 1]. This 
means that / is ao times continuously differentiable with ||/|| a < oo, ao 
being the greatest integer less than a and the seminorm || • || a being defined 
by 

|/(°°)(a)-/( Q °>(s')| 
ll/IU = sup /|Q _ ao • 

Consider a design (:rf^)n>i,i<i<n.j not necessarily uniform. Here Fq is the 
vector (f(xf^))\<i< n . Once again we suppose that a n = a does not depend 
on n. 

Fix an integer q > a, and let K = k n + 1 — q. Partition the interval (0, 1] 
into K subintervals ((j — 1)/K,j/K] for 1 < j < K. We want to perform the 
regression of / in the space of splines of order q defined on that partition, and 
use the I?-splines basis (Bj)i<j<k n [see, e.g., de Boor (1978)]. Our collection 
of regressors is tj)j = (Bj(x^))i<i< n , for 1 < j < k n . 

For any value of n > 1, let W be a prior distribution on M. kn and, for 
F = &9, let W be the prior distribution on F £ R n obtained from W on 6. 
Its support is included in ((/)} . Let P w denote the marginal distribution of Y 
under prior W, and W(dG(F)\Y) denote the posterior distribution of a func- 
tional G(F). Note that everything depends on n (W, e.g., is a distribution 
on R n ) even if we do not use n as an index to simplify our notation. 

Both the parametrization by 9 and the corresponding collection of regres- 
sors 0i, ... , 4>k n are arbitrary: what matters is the posterior distribution of F 
and this depends on the space {<j)), not on the basis used to parametrize it. 
The span (<ft) is characterized by the matrix S = ^(^ T ^)~ 1 ^ T of the or- 
thogonal projection onto ((/)). 

The prior W is a sieve prior: that is, its support comes from a finite- 
dimensional model whose dimension k n grows with n. The collection of 
growing models {(f)) (the sieve) can be seen as an approximation frame- 
work, each model being possibly misspecified. There is no true parameter in 
our setting: the true mean vector Fq may fall outside (0) and correspond to 
none of the possible values of 9. There is then a bias which has to be dealt 
with, linked to the choice of the cutoff k n . 

When dealing with Bernstein-von Mises results, the question of the asymp- 
totic centering point arises. In nonparametric models constructed on an 
infinite-dimensional parameter, there is no definition of a MLE; what the 
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natural centering for a Bernstein-von Mises theorem should be in such sit- 
uations is not clear. In the model {(ft), the orthogonal projection Yu,) = ^Y 

of Y is also the MLE of F . We set 6y = (<J> T $) -1 <J> 7 Y its associated pa- 
rameter. Let also Fu\ = <&6>o be the projection of Fq on (</>), with 9q = 
($ T $) _1 $ T F . Now, Fq — Fu\ corresponds to the bias introduced by the 
use of the model {(f)), and F^ is the centering point of the distribution of 
the MLE Y@) under P Fo : 

Although the MLE is naturally defined in the sieve {(f)) , it heavily depends on 
the choice of {(f)) . Therefore, the Bernstein-von Mises theorems we establish 
depend on the choice of the sieve the prior distribution is built on. 

3. Nonparametric Bernstein von Mises theorems. The proofs of our 
nonparametric results are delayed to Section 6. 

3.1. With Gaussian priors. We consider here a centered, normal prior 
distribution W which is isotropic on (</>), so that W = A/"(0, t^£) for some 
sequence r n . r n is a scale parameter, and essentially the only assumption 
needed in this case is that r n is large enough as n grows. Let \\Q — Q'Htv 
denote the total variation norm between two probability distributions Q 
and Q'. 

Theorem 1. Assume that o~ n = o{r n ), \\Fq\\ = o{T^/a n ) and k n = 
o{4/ at). Then 

E\\W{dF\Y) - M{Y {4>) , ^S)|| TV ^0 as n -> 00. 

In terms of instead of F, an equivalent statement is 

E\\W{d6\Y) -AA(^y,cr^($ T $)- 1 )|| TV ^0 as n^oo. 

Theorem 1 does not deal with the modeling bias introduced by taking 
a prior restricted to (</>). This is an important question in nonparametric 
statistics, and k n has to be chosen in order to achieve a satisfactory bias- 
variance trade-off. 

As an example, let us consider a typical regression framework with Fq = 
(fo(%i))i<i<n, where /o is some function and (xj)i<j< n some design. If a n 
does not depend on n, both conditions ||Fo|| = o(r^/a n ) and k n = o(r^/a^) 
are satisfied if /o is bounded and n 1 / 4 = o(r n ). These conditions can be read 
in another way: r 4 must be large enough with respect to ||-Fo|| an d k n . 

3.2. With smooth priors. We consider now more general priors. To un- 
derstand better the conditions we use, we need to look at the mechanics of 
the Bernstein-von Mises theorem. 
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Behind a Bernstein-von Mises theorem there is a LAN structure: the log- 
likelihood admits a quadratic expansion near the MLE. Since the posterior 
density is proportional to the product of the prior density and the likelihood, 
the prior has to be locally constant to let the likelihood alone influence the 
posterior and produce the Gaussian shape. To prove a Bernstein-von Mises 
theorem, we look for a subset which is simultaneously (1) large enough, so 
that the posterior will concentrate on it, and (2) small enough, so that we 
can find approximately constant priors on it. The larger the dimension of 
the model is, the more difficult it is to combine these two requirements, and 
the more difficult it is to obtain a Bernstein-von Mises theorem. 

The geometry of the subsets are naturally suggested by the normal dis- 
tribution we are looking for. For M > 0, consider the ellipsoid 

(6) £e M M ) = Rkn ■ (0 " Oo) T $ T HO - 6 ) < a 2 n M}. 



Theorem 2. Suppose that W is induced by a distribution W on 9 ad- 
mitting a density w(9) with respect to the Lebesgue measure. If there exists 
a sequence (M n ) n >i such that: 

(1) 8a.P\\9h\\*<o*M n ,\\*g\\*<<T*M n w(0 o +g) ^ 1 aSU-^OO, 

(2) k n \uk n = o{M n ), 

(3) max(0,ln(^ffl))= O (M n ), 

then 

E\\W(dF\Y) -Af(Y {lf>h a^)\\ TY ^0 asn^oo. 

With condition (1) below we ask for a sufficiently flat prior W in an 
ellipsoid Sq q $(M n ). Condition (2) ensures, in particular, that the weight the 
normal distribution puts on Sg 0t ^(M n ) in the limit goes to 1. Condition (3) 
makes quantities linked to the volume of £g 0i $(M n ) appear and guarantees 
that it has enough prior weight. This kind of assumption is common in the 
literature dealing with the concentration of posterior distributions; see, for 
instance, Ghosal, Ghosh and van der Vaart (2000). 

Several of our applications illustrate that priors known to induce the pos- 
terior minimax convergence rate may not be flat enough to get the Gaussian 
shape with the asymptotic variance cr^S. 

An important remark is the following: condition (2) does not really limit 
the growth rate of k n . Read in conjunction with the other two conditions, we 
see that a flatter prior distribution will permit us to take M n larger. Thus, 
the only condition on the growth rate of k n is k n <n. 
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Note that Theorem 2 is not a generalization of Theorem 1: Theorem 1 is 
more powerful for isotropic Gaussian priors. Consider again the regression 
framework with Fq, = (/o(^i))i<i<nj where /o is a bounded function and 
(%i)l<i<n is some design. Suppose that a n does not depend on n, and take 
k n = n and W = AA(0, r 2 X). Then the conditions of Theorem 1 are satisfied 
as soon as n 1 / 4 = o(r n ), but with Theorem 2 we need nlnn = o(r 2 ). 

Our main applications, to the Gaussian sequence model and to the re- 
gression model using trigonometric polynomials and splines, are developed 
in Section 5. We now present two remarks about the parametric case and 
the comparison with the pioneer work of Ghosal (1999). 

The parametric case. Consider the regression of a function / defined 
on [0,1], with a fixed number k of regressors. Set a design (x^ ) n >l,l<i<n, 

with x-™^ G [(i — l)/n,i/n] for any n > 1, and Fq = (/(a^ ))i<j< n . Choose 
a finite number of piecewise continuous and linearly independent regres- 
sors (<Pj)i<j<k on [0,1], and set <pj = ((fj(xl n ^))i<i< n for l<j<k. Assume 
that /, k n = k, a n = a and W do not depend on n. 

We would like to compare Theorem 2 with the usual Bernstein-von Mises 
theorem for parametric models applied to such a regression framework. In 
that setting, let us suppose that w is continuous and positive, and that / 
is bounded. Then condition (1) becomes M n = o(n), while condition (3) 
reduces to Inn = o(M n ). Clearly, there exist such sequences (M n ) n >i, so 
Theorem 2 applies. Here the rescaling by y/n of the Bernstein-von Mises 
theorem for parametric models is hidden in the asymptotic posterior vari- 
ance a 2 (^ T ^)~ 1 of the parameter 0. Indeed, (1/n) $ T <I> is a Riemann 
sum and converges toward the Gramian matrix of the collection (¥>j)i<i<fc 
in L 2 ([0,1]). 

PROOF. We have ||$0 O || < ll^oll < v^ll/IU, and ||# || 2 < ||($ r $) -1 || • 
H^oll 2 < ||n(<I> T <l>) _1 || Il/H 2 ^. (1/n) <1? T <3? converges toward the Gramian ma- 
trix of the collection (<Pj)i<j<k i n L 2 ([0,1]), and its smallest eigenvalue is 
lower bounded for n large enough. Therefore, 6q is bounded, and we can 
consider it lies in some compact set on which w is uniformly continuous and 
lower bounded by a positive constant. The rest follows. □ 

Comparison with Ghosal's conditions. The Bernstein-von Mises theorem 
in a regression setting when the number of parameters goes to infinity has 
been first studied by Ghosal (1999) as an early step in the development of 
frequentist nonparametric Bayesian theory. In his paper the errors £j are 
not supposed to be Gaussian. Under the Gaussianity assumption we get 
improved results, which means that we have a nontrivial generalization of 
the Ghosal (1999) conditions in the case of Gaussian errors. In particular, 
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our condition for the prior smoothness is simpler, and the growth rate of 
the dimension k n is much less constrained: 

• Ghosal (1999) does not admit a modeling bias between Fq and <3?#o- in 
the present work the normality of the errors permits us to take Fq ^ <&9q 
without any cost, as it appears in the core of the proof (Lemma 7). The 
possibility of considering misspecified models is an important improve- 
ment. 

• In Ghosal (1999) a n is constant, which does not allow the application to 
the Gaussian sequence model. 

• Ghosal (1999) restricts the growth of the dimension k n to /c^ln/c n = o(n) 
(see below). In our setting we only require k n < n. With Ghosal's condition 
we could not have obtained the applications to the Gaussian sequence 
model or to the regression model for Sobolev or C a classes. 

Let 5 2 = IK^"^) -1 !! be the operator norm of (<I>- r <I>)~ 1 for the £ 2 metric, 
and let r] 2 be the maximal value on the diagonal of £. With our notation, 
the last two assumptions of Ghosal (1999) become: 

(A3) There exists tjq > such that w{9q) > t/q". Moreover, 

(7) \lnw(e) - lnw(e )\ < L n (C)\\9 - 9 \\, 

whenever \\8 — 9q\\ < C5 n k n \/lnk n , where the Lipschitz constant L n (C) is 
subject to some growth restriction [see assumption (A4)]. 
(A4) 

(8) VC>0 L n (C)d n k nV / ]^k~^0 and Vn k^ 2 v / \^k~ ^ 0. 
Further, the design satisfies a condition on the trace of $ T $: 

(9) tr($ T $) = 0(ra£; n ). 

Since £ is an orthogonal projection matrix on a /c n -dimensional space, 
tr(E) = k n and rj 2 , > k n /n. Thus, the last part of (8) implies k^lnkn = o(n). 

If we add the normality of the errors and a slight technical condition 
In n = o{k n In k n ) , these assumptions imply ours. Indeed, set M n = C 2 k 2 In k n 
for some arbitrary value of C . Our condition (2) is immediate. Condition (1) 
is got from (7) and the first part of (8). The beginning of (A3) implies 
— \n.w{9o) = 0(k n ) = o{M n ). Using the concavity of the In function and (9), 
we get lndet($ T <I>) < k n lntr(<I> T <I>) - k n lnk n = 0{k n \nn) = o{M n ). There- 
fore, our condition (3) holds. 

4. Semiparametric Bernstein von Mises theorems. We consider two kinds 
of functionals of F: linear and nonlinear ones. These results can be easily 
adapted to functionals of 9, using the maps 9 ^^9 and F i— > ($> T &)~ l Q T F. 
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4.1. The linear case. For linear functionals of F, we have the following 
corollary: 

Corollary 1. Let p > 1 be fixed, and G be a W x W 1 -matrix. Suppose 
that the conditions of either Theorems 1 or 2 are satisfied. Then 

E\\W(d(GF)\Y) -M(GY {lj>) ,alG^G T )\\ TV ^0 asn^oo. 
Further, the distribution of GY^ is N(GF^ , a1 L GTiG T ) . 

Corollary 1 is just a linear transform of the preceding theorems, and of 
the distribution of Y^x . 

An example of application is given in Section 5.2, in the context of the 
regression on Fourier's basis. 

4.2. The nonlinear case. The Bernstein-von Mises theorem which is pre- 
sented here for nonlinear functionals is derived from the nonparametric the- 
orems thanks to Taylor expansions. In the Taylor expansion of a functional, 
the first order term naturally leads to the posterior normality, as in the 
case of linear functionals. We do not want that the second order term in- 
terfere with this phenomenon: it has to be controlled. The conditions of 
Theorem 2 below are stated to permit this control of the second order 
term. 

Let p > 1 be fixed, and G : R n i— > M p be a twice continuously differentiable 
function. For F £ M. n , let Gf denote the Jacobian matrix of G at F, and 
D F G{-,-) the second derivative of G, as a bilinear function on W 1 . For any 
F £ (<p) and a > 0, let 

(10) B F {a)= sup sup \\D F+th G(h,h)\\, 

he{4>).\\h\\ 2 <^a0<t<l 

where || • || denotes the Euclidean norm of R p . 

We also consider the following nonnegative symmetric matrix 

(11) T F = o- 2 n G F Y,Gp. 

In the following, HT^H denotes the Euclidean operator norm of , which 
is also the inverse of the smallest eigenvalue of Tp ■ 

Let T be the collection of all intervals in K, and for any I £ X, let = 
P{Z £ I), where Z is a M(0, 1) random variable. Recall that Y^ is the MLE 
and the orthogonal projection of Y on ((f)). 

Theorem 3. Let G : M. n \— > M p be a twice continuously differentiable func- 
tion, and let Tp be as just defined. Suppose that is nonsingular, and 
that there exists a sequence (M n ) n >i such that k n = o{M n ) and 

(12) Bp(M n ) = o(\\T-l J- 1 ). 
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Suppose further that the conditions of either Theorems 1 or 2 are satisfied. 
Then, for any b S R p , 



(13) E 



sup 









/ b T (G(F)-G(Y {4>) )) 



el 



b T r Fw b 



Y 



</>(/) 



o 



as n — ¥ oo. 



Under the same conditions 
P 



(14) sup 

lex 



osn-> oo. 



Note that sup /g j |Q(-0 — Q'COI is the Levy-Prokhorov distance between 
two distributions Q and Q' on R. The Levy-Prokhorov distance metrizes the 
convergence in distribution. So, when p = 1 the Levy-Prokhorov distance 
between the distribution W{dG{F)\Y) and J\f(G(Y^),Tp.^.) goes to in 
mean, while G(Y^) goes to Af(G(F^) ,T p w ) in distribution. 

An application of Theorem 3 is given in Section 5.2, in the context of the 
regression on Fourier's basis. The proof is delayed to Section 6.3. 



5. Applications. Here we give the three applications described in Sec- 
tion 2. The models studied and the collections of regressors used have already 
been defined there. 



5.1. The Gaussian sequence model. We consider the model (2). Here the 
MLE is the projection 9y = 0^j)i<j<k n - 

The nonparametric case corresponds to the estimation of 9°. Under the 
assumption that 9° is in some regularity class, we will obtain a Bernstein- 
von Mises theorem with the posterior convergence rate already obtained 
in previous works, in particular, Ghosal and van der Vaart (2007). On the 
other hand, for some priors known to achieve this rate, it will be seen that 
the centering point and the asymptotic variance of the posterior distri- 
bution do not fit with the ones expected in a Bernstein-von Mises theo- 
rem. We also look at the semiparametric estimation of the squared £ 2 norm 
of 9°. 



5.1.1. The nonparametric estimation of 9°. 



Proposition 1. Suppose that X^=i(#j) 2 is bounded. This holds when 9° 

is an element of £ 2 (N) not depending on n. With a prior W = AA(0,r^/fc n ) 
such that n -1 / 4 = o(r n ), we have for any sequence k n < n, 



E 



W(d9\Y)-M[9 Y ,-I kr 
n 



as n - 



oo, 



TV 
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and the convergence rate of 9 toward 9q is y •' for every \ n — > oo, 



E 



W \\9 — 9n\\ >X n \ — 



Y 



0. 



Recall that 9\ 



o 



i<j<k n is the projection of 9°. 



Proof of Proposition 1. The beginning is an immediate corollary 
of Theorem 1. For the convergence rate, let X n — > oo. Since 9y — 9q ~ 

JV(o ,i'hJ, 



In the same way 



w\ 110 -M > 




which goes to 0. Therefore, 



E 



< E 



W(d9\Y)-M[9 Y ,-I kn 



+ 0,-/ fc . 
n 



w\ \\9-e \\>x n \l — 




0. 



TV 




□ 



However, in such a general setting we have no information about the bias 
between 9° and its projection 9q. Several authors add the assumption that 
the true parameter belongs to a Sobolev class of regularity a > 0, defined by 
the relation Y^=l\@j \ 2 j 2a < oo. In this setting we show that for some priors 
the induced posterior may achieve the nonparametric convergence rate but 
with a centering point and a variance different from what is expected in the 
Bernstein-von Mises theorem. Then we exhibit priors for which both the 
Bernstein- von Mises theorem and the nonparametric convergence rate hold. 

From now on, we suppose that J2T=i \ d j\ 2 f a < oo- In this setting Ghosal 
and van der Vaart (2007), Section 7.6, consider a prior W such that 9±, 
02, ■ • • are independent, and 9j is normally distributed with variance <r| kn . 
Further, the variances are supposed to satisfy 



(15) 



c/k n < mm{al k J 2a : 1 < j < k n } < C/k n 



for some positive constants c and C. Suppose that a > 1/2 and there exist 
constants d and C 2 such that dn 1 /^ 2 ^ <k n < C 2 ?i 1/(1+2a) . Then Ghosal 
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and van der Vaart (2007), Theorem 11, proved that the posterior converges 
at the rate n" a /( 1+2a ). 

In order to get n -1 /^ as asymptotic variance, we need more stringent 
conditions on k n , or a flatter prior. To see this is necessary, consider, for 
k n ~ n 1 /^ 14 " 20 ), the following choice for (Tj kn- 

2 [K\ ifl<J<W2, 
^ \2 2a /n, ifj>W2- 

Then min{<7 2 fcii j 2a :l<j< k n } ~ /c" 1 , and the posterior converges at the 
rate n -"/(i+2"). 

For this case we can explicitly calculate the posterior distribution. This is 
similar to the calculation made in the proof of Theorem 1. The coordinates 
are independent, and 

widely) =^(^^y j , -^r-) ■ 

For j > k n /2, ^h- = and, therefore, \\Wid9j\Y) - AA(y„a 2 )|| T v is 

bounded away from 0. 

By contrast, with an isotropic and flat prior we obtain the centering point 
and the asymptotic variance we expected, and the same convergence rate as 
previously. We have the following: 

Proposition 2. Suppose that 9° belongs to the Sobolev class of regular- 
ity a > 0. Choose a prior W = A/"(0, t 2 /^^) such that n -1 / 4 = o{r n ), which 
ensures the asymptotic normality of the posterior distribution as in Propo- 
sition 1. 

If further k n ~ ?i 1 /( 1+2a ) J then the convergence rate of 8 toward 9q and 
toward 9° is n - a /( l + 2a ) : f or every \ n — > oo, 

E[w(\\e - e°\\ > A n n" Q /( 1+2Q )|y)] -> o. 

Proof. We consider 9 and 9q as elements of ^ 2 (N) by setting 6j = 
@o,j = f° r j > k n + 1. The convergence rate toward 9q has already been 
established in Proposition 1. Since 8qj = 6® for 1 < j < k n , \\9° — 9q\\ < 

^n" \J^2"j°=k n +i(^) 2 ^ 2a = 0(k~ a ). Therefore, the convergence rate of 9 to- 
ward 9° is also n" a /( 1+2a ). □ 

5.1.2. Semiparametric theorem for the £ 2 norm of 9°. We consider the 
prior distribution used in Proposition 2, but now we look at the posterior 
distribution of ||#|| 2 . To get asymptotic normality with variance n -1 / 2 , we 
just need k n = o(y/n). To control the bias term, we need a > 1/2, and in 
this case we get an adaptive Bayesian estimator. 
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Proposition 3. Let a > 1/2 and suppose that 6° belongs to the Sobolev 
class of regularity a. Choose a prior W = A/"(0,T 2 Ifc n ) such that n -1 / 4 = 
o(r n ). Then, for any choice of k n such that k n = o{^/n) and y/n = o(/c 2a ), 

?-\\0y\\ 2 ) 



anc 









E 


sup 






./ex 






2 -|| 


2||0°|| 



2||0°|| 



6 / 



y 







as n —)• oo 



zs negligible with respect to the square root of the variance: 

v^mw 2 - \\o°\\ 2 " 



2|in 



o(l). 



/n particular, the choice k n = y/n/lnn is adaptive in a. 

Proof. We set up an application of Theorem 3. Since a n = n -1 / 2 , the 
conditions of Theorem 1 are fulfilled. 

Here G{6) = 9 T 6, G e = 26 T and G e = 2L kn . Therefore, B 6o (M n ) = 2M n /n, 
while r eo =4\\e \\ 2 /n. 

Let us choose (M n ) n >i such that k n = o(M n ) and M n = o{y/n). Such 
sequences exist and fulfill the conditions of Theorem 3. 



Since ||#n|| 2 



! , we can substitute the variance Tq q by 4||# || 2 /ro and 



get the two asymptotic normality results, (13) and (14). 

Asn^oo, ||0 O || 2 -||0 O || 2 = \\6°-6 \\ 2 = 0(k~ 2a ), as in the proof of Propo- 
sition 2. If Vn = o(A; 2a ), we get >/n(\\6 \\ 2 - ||0°|| 2 ) = o(l). □ 



5.2. Regression on Fourier's basis. Now we consider the regression mo- 
del (3) with a function / in a Sobolev class W(a,L), and use Fourier's 
basis (4). For any 6 G R kn , we define fg = Yl^i^jfj- We also denote by 
9°£t 2 (N) the sequence of Fourier's coefficients of /: / — i Oj <Pj ■ 

The following useful lemma about our collection of regressors can be 
found, for instance, in Tsybakov (2004) (we slightly modified it to take 
into account the case n even): 

Lemma 1. Suppose either that n is odd and k n <n, or n is even and 
k n < n — 1. Consider the collection {<pj)i<j<k n defined before, and $ the 
associated matrix. Then 

This makes the regression on Fourier's basis very close to the Gaussian 
sequence model, and the results we obtain are similar. 

In this subsection we first consider the estimation of / in a Sobolev class, 
for which we get a Bernstein-von Mises theorem and the frequentist minimax 
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n —a/(i+2a) p OS t er ior convergence rate for the L? norm. Then we consider two 
semiparametric settings: the estimation of a linear functional of /, and the 
estimation of the L 2 norm of /. We get the adaptive \fn convergence rate 
for any a > 1/2. 

5.2.1. Nonparametric Bernstein-von Mises theorem in Sobolev classes. 

Proposition 4. Suppose that f belongs to some Sobolev class W(a,L) 
forL>0 and a > 1/2. Let k n K. n 1 A 1+2Q and W = M(0,j n I kn ) be the prior 
on 8, for a sequence (7 n )n>i such that 1/y/n = o(j n ). Then 

■2 



E 



W(d9\Y)-M(6 Y ,—I kr 
n 







as n — )• oo 



TV 



and the convergence rate relative to the Euclidean norm for fg is n a /( 1 + 2a ) : 
for every X n — > oo, 

E[W(\\f e - /|| > A n n- Q /( 1+2Q )|Y)] -> 0. 

PROOF. The conditions of Theorem 1 are fulfilled: with r 2 = wy n , we 
have n = o(r^). The first assertion follows. 

Because of the orthogonal nature of Fourier's basis, \\fg — f\\ = 



in I (N). We use the decomposition ||0 



< 



+ \Wo 



-o°\\ 

2 . In 



the same way as in the proof of Proposition 1 , for any A n — > oo , 

n 

Going back to Definition 1, we have 



E 



w\ \\e 



> A 




0. 



]T (9°) 2 <k- 2a Yl a t n 

j=k n +l j=k n +l 

This permits to get 

E[W(\\B-e°\\ > A n n- Q /( 1+2Q )|Y)] 



T-Y = 0(k n 



□ 



5.2.2. Linear functionals of f . Let g : [0, 1] — > R be a function in L 2 ([0, 1]). 
We want to estimate J~(f) = Jq fg, and we approximate it by 



n 

-Y j g(i/n)f(i/n) = GF Q , 



i=l 



where G = (g(i/n)/n)J <i<n . The plug-in MLE estimator of GFq in the mis- 
specified model {(f)) is GY^\. More generally, we consider the functional 
F i— > GF. The following result is adaptive, in the sense that the same choice 
k n = [n/lnnj entails the convergence rate ra" 1 / 2 for all values of a > 1/2. 
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Proposition 5. Suppose f is bounded, and let W be the prior induced 
by the A/"(0,7„ Ik n ) distribution on 9, for a sequence ("f n )n>i such that 
l/v / « = o(7 n ). Then: 

(1) 

E\\W(d(GF)\Y)-M(GY w ,a 2 GJ:G T )\\ TY ^0 

and the distribution of GY^ is J\f(GF^,a 2 GT,G T ). 

(2) Suppose further that f and g belong to some Sobolev class W(a,L) 
forL>0 and a > 1/2. Then GT,G T ~ ± Q g 2 , 

,v^(GF - GY {4>) \ 



E 



W d 



a yJi9 2 



Y 



TV 



oo. 



and ^™( GY (^ 7^(0, 1) in distribution, as n 

CT V Jo 1 9 2 

(3) Suppose that f and g belong to some Sobolev class W(a, L) for L > 
and a > 1/2, and suppose further that k n is large enough so that n = o(k 2a ). 
Then the bias is negligible with respect to the square root of the variance: 

^(GF^ - F(f)) 



a 



o(l). 



Before the proof we give two lemmas, proved in Appendix B in the sup- 
plemental article [Bontemps (2011)], about the error terms of the approx- 
imation of a Sobolev class by a sieve build on Fourier's basis, and of the 
approximation of an integral by a Riemann sum. 

Lemma 2. Let a > 1/2 and L > 0. We suppose n odd or k n < n. If 
f€W(a,L), 

V2LJn 



(4>)\ 



<(l + o(l)) 



Further, \\Fq\ 



v 



Jo 1 / 2 and \\F -F { 



O(k-<*\\F \ 



Lemma 3. Let two functions f € W(a,L) and g € W(at',L') for some 
a, a' > 1/2 and two positive numbers L and Ll . Then 



1 ™ f 1 

-^/( i / ra )ff(*/ n ) - fa 

n i=i Jo 



0{n 



■ inf (a, a') ^ 



Proof of Proposition 5. (1) The first assertion is just Corollary 1. 
The conditions of Theorem 1 are fulfilled, as in the proof of Proposition 4. 
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(2) UgeW(a,L) for L > and a > 1/2, GT>G T = ||£G T || 2 ~ ||G T || 2 by 
Lemma 2. In the meantime ||G T || 2 = -kg Y27=i 9 2 ( x i) ~ ~ Jo 1 9 2 ^ Lemma 3. 
So GT,G T ~ — Jq 1 g 2 , and the variance in the formulas of Corollary 1 can be 

substituted with ^ g 2 . 

(3) We decompose the bias into two terms, \GFq — F{f)\ and \GFi^\ — 
GFq\, and show that both are o(n -1 / 2 ). The first term is controlled by 
Lemma 3. For the last one, \GF {(j>) - GF \ < \\G T \\ \\F W - F \\. But ||G T || = 

0(n^/ 2 ), \\F W - Fo|| = O(A^ a ||F ||) by Lemma 2 and ||F || = 0(y/n). We 
conclude thanks to the assumption n = o(k 2a ). □ 

5.2.3. L 2 norm of f. Suppose that we want to estimate F{f) = Jq f 2 - 
We can consider the plug-in MLE estimator 

1 1 " ( kn \ 2 

i=\ \j=i ) 
More generally, we define, for any F E W 1 , 

(16) G(F) = -\\F\\ 2 . 

n 

With a Gaussian prior, we obtain the following result, which is also adap- 
tive: the same k n = Y^/n/ \an\ is suitable whatever a > 1/2. 

Proposition 6. Let G(F) = ||F|| 2 /n. Suppose that f E W(a,L) for 
some L > and a > 1/2. Let W be the prior induced by the AA(0,7 n 1^) 
distribution on 9, for a sequence ( / y n )n>i such that 1/y/n = o(7 re ). The se- 
quence (k n ) n >i can be chosen such that k n = o( v / n) and v / n = o(/c 2Q ) ; and 
with such a choice, 



E 


sup 






J6X 





2a^/F[f) 



y - 



as n — >■ oo 



and —^^IJ-iiL G ( F (<t>) ))_ _ $.j\f(Q \\ { n distribution, as n— >oo. Further, the 
bias is negligible with respect to the square root of the variance: 

^1{G{F^)-Hf)) m 

r = Oil). 

A similar corollary could be stated for a non-Gaussian prior. 

Proof of Proposition 6. First, let us note that the conditions of 
Theorem 1 are fulfilled, as in the proof of Proposition 4. Lemma 10 in Ap- 
pendix B insures that / is bounded. 
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In this setting G F = (2/n)F T and D 2 F G(h,h) = (2/n)\\h\\ 2 for any F G M n 
and any h G M n . Therefore, 5j?(a) = 2a 2 a/n, and I> = 4(cr 2 /n 2 )||F|| 2 . By 
Lemma 2, ||F (0> || 2 ~ ||F || 2 ~ n3F{f). Thus, I> w = 4(1 + o{l))3F{f)/n. 

Let us choose (M re ) re >i such that k n = o(M n ) and M n = o(y / n). Such 
sequences exist and fulfill the conditions of Theorem 3. We can substitute 
the variance F>. „ by 4J r (/)/n and get the two asymptotic normality results. 

Let us now consider the bias term: 

II TP || 2 _ || p 1 1 2 / /-l i n \ 

w) - g(f w) < " °" w " + ^ / 2 - ^E/ 2 (^)j • 

We use Lemma 2 to control ||i*b|| 2 — 11-^(0) l| 2 > an d Lemma 3 for the other 
term: 

\Hf)-G{F {<t>) )\ = 0{K 2a ) + 0(n- a ). 
This is a o(l/y / n) under the assumptions of Corollary 6. □ 

5.3. Regression on splines. Here we consider the regression model for 
functions in C Q [0, 1] with a > 0, using splines, set up in Section 2. We first 
develop further the framework and the assumptions used here, and recall 
the previous result of Ghosal and van der Vaart (2007), Section 7.7.1, which 
obtains the posterior concentration at the frequentist minimax rate. Then 
we present two Bernstein-von Mises theorems: the first one with the same 
prior as Ghosal and van der Vaart (2007) but a stronger condition on k n (or 
equivalently on a); the second one with a flatter prior, for which we obtain 
the minimax convergence rate in addition to the asymptotic Gaussianity of 
the posterior distribution. 

To see this, we begin with some preliminaries. For any 6 G ffi fcn , define fg = 

^2j=i@jBj- The i?-splines basis has the following approximation property: 

for any a > 0, there exist C a > such that, if / G C a [0, 1], there exists 
goo g j£fc n sa ti s fy mg 

(17) n/-/^iioo<«rii/iu- 

We need the design (^ n ^)n>i,i<i<n to be sufficiently regular and, as 
stressed in Ghosal and van der Vaart (2007), the spatial separation prop- 
erty of i?-splines permits us to express the precise condition in terms of the 
covariance matrix We suppose that there exist positive constants C\ 

and C2 such that, as n increases, for any G R fcn , 

(18) C 1 ^\\0\\ 2 <9 T ^ T ^0<C 2 ^\\9\\ 2 . 

K n K n 

Let us associate the norm ||/|| n = \J \ YH=i\f{ x i)\ 2 to the design. Note 
that v^ll/elln = ll^ll if # G M. kn . Under (18) we have a relation between 
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|| • || n and the Euclidean norm on the parameter space: for every 9\ and 82, 

Ci||0i - 6 2 \\ < VK\\f 6l - fe 2 \\n < tf 2 ||0i - 9 2 \\. 
With these conditions Ghosal and van der Vaart (2007), Theorem 12, get 
the posterior concentration at the minimax rate. Take a > 1/2, let W = 
jV(0, 7fc n ) be the prior on the spline coefficients, and suppose there exist 
constants C 3 and C 4 such that C^n 1 /^ 2 ^ < k n < C , 4 n 1 /( 1 + 2a ) . Then the 
posterior concentrates at the minimax rate n~ a ^ 1+2a ^ relative to || • || n : for 
every A n — > 00 , 

E[W(\\f e - f\\ n > A n n- Q /( 1+2Q )|y)] -> 0. 

This is equivalent to a convergence rate rS 1 " 2,01 ^ ^/( 2 ( 1+2 °)) relative to the Eu- 
clidean norm for 6: 

E[w(\\e - o \\ > A„n(^ 2Q )/( 2 ( 1+2Q ))|y)] -> 0. 

Indeed, (17) and the projection property imply 

H/00 - f\\n < Wfeoo - f\\n < \\fe~ - /Hoc < C a \\f\\ a k- a . 
Now, with modified assumptions we get the Bernstein-von Mises theorem 
in two different settings. First, with the same prior as Ghosal and van der 
Vaart (2007): 

Proposition 7. Assume that f is bounded, k n = o{(^) 1 ^) and (18) 
holds. Let W = M(0,lk n ) be the prior on the spline coefficients. Then 

(19) E\\W(d0\Y)-M(0 Y ,o 3 (<f> T $)~ 1 )\\ TV ^ asn^oo 
and the convergence rate relative to the Euclidean norm for 6 is ^= . 

Remarks. We need a > 1 to get the Gaussian shape with the same con- 
vergence rate as in Ghosal and van der Vaart (2007). The conditions of 
Proposition 7 are satisfied, in particular, if there exist constants C3 and C4 
such that C^n l ^ l+2a ^ <k n < C^n 1 ^ 1 " 1 " 2 ") . In this case the convergence rate 

for 8 is n (l-2«)/(2(l+2a))_ 

Proof of Proposition 7. We set up an application of Theorem 2. We 
can choose M n such that k n lnn = o(M n ) and M n = o(-A-). Assumption (2) 
is then trivially satisfied. 

From (18) we get ||$ T «>|| < C 2 j^ and IK^^)^ 1 )! < Cf 1 ^ • We have 
also lndet($ T $) < fc„lnC 2 + k nHjr) = 0{k n lnn) = o{M n ). Since 8 = 
$($ T *)- 1 i<b, 

^ ol|2 -cfe l|Fol|2 -^r^- 

Therefore, - \nw(0 Q ) = 0(1) + ±||6> || 2 = 0(k n ) = o(M n ), and assumption (3) 
holds. 
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Let heM. kn such that ||<l»/i|| 2 < a 2 M n . We have \\h\\ 2 < || ($ T $)- 1 1 
\\$h\\ 2 < ^M^=o(k- 1 ). Therefore, 

w(9 + h) , l|/i|| 2 + 2| 



(20) sup In — v - ' ' < sup 

\\®h\\ 2 <<T 2 M n W{&o) \\<f>h\\ 2 <v 2 M„ 

and assumption (1) follows. 

Let us now prove the convergence rate. Let X n — > oo. Then 



o(l) 



P 



since ||$(#y 



(\\0Y 



> 



X n k n \ 



<p MdY-e w> 



2 ^ CiX^kr, 



? )|| 2 ~~ 2 - 2 



a 2 x 2 (k n )- In the same way 



W( ||0-0y|| > 



X n k n \ 



I TV 



+ AA(0,cj 2 ($ t $)- 1 )^|/i: 



o. 



where Theorem 2 controls the first term in the right. Therefore, assump- 
tion (3) holds: 

Xrt krt 



E 



W[ \\9-9q\\ > 



0. 



Now, (19) is the same as Theorem 2 in terms of W . □ 

The situation is similar to the one we encountered with the Gaussian 
sequence model. To get the Bernstein-von Mises theorem with the same 
convergence rate as Ghosal and van der Vaart (2007) for a < 1, we need 
a flatter prior: 

PROPOSITION 8. Assume that f is bounded and (18) holds. Let W = 
Af(0,T 2 Ik n ) be the prior on the spline coefficients, with the sequence r n sat- 
isfying 

k 2 In n i i\ 7 km In Tt / 4 \ 
--o(r n ) and = o{r n ). 



n 



Then 



E\\W(d9\Y) - N{9 Y , cj 2 ($ t $)- 1 )|| tv ^ asn^oo 
and the convergence rate relative to the Euclidean norm for 9 is . 



When a > and k n is of order n 1 ^ 1+2a \ the conditions reduce to 
n (2-2a)/(i+2a) j nn _ (t^). So we obtain the convergence rate of Ghosal and 
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van der Vaart (2007) in addition to the Gaussian shape with the same h n , 
even for a < 1, but with a different prior. 



Proof of Proposition 8. The proof is essentially the same as for 

2 

Proposition 7. M n can be chosen so that k n lnn = o(M n ), M n = o(-£ IL ), and 

4 

M n = o(-pr). These last two conditions are the ones needed to obtain the 
same upper bounds as in (20). □ 

6. Proofs. 

6.1. Proof of Theorem 1. In the present setting all distributions are ex- 
plicit and admit known densities with respect to the corresponding Lebesgue 
measure. We decompose any y € W 1 in two orthogonal components y = 
My + y', with ® T y' = 0. Then 

1 

2rf 



dP e {y) = Ci exp<! -^(||$#|| 2 + \\<$>e y \\ 2 + ||y'|| 2 - 29 T <S> T $>9 y ) 



dW(9) = c 2 eitp\ -^||$0|| 2 



n 

1 



dP e (y)dW(9) = c 1 c 2 exp 



2(T 2 T 2 



n 
1 



y II 2 



2(^ + ^)"- 2aV 

where c\ = {2Tr)- n / 2 o-- n and c 2 = (27r)- fc ™/ 2 7£ fc ™ det($ T $)~ 1 . 

Using the Bayes rule, we get the density of W(d9\Y), in which we recognize 
the normal distribution 

(21) w(de\Y)=Af(^^e Y ,^^ 1 ^ T ^)- 1 

So we have an exact expression for W(d6\Y), but the centering and the 
variance do not correspond to the limit distribution given in Theorem 1. 
Therefore, we make use of the triangle inequality, with intermediate distri- 

bution Q=M( 7 ^L6 Y ,o- 2 n (<5> T $r 1 ): 

\\W{d9\Y) -M(e Y ,oi(<!> T $)- l )\\ TY 

< \\W{d9\Y) - Q|| TV + ||Q -M(9 Y , a 2 ($ T $)-i)|| TV . 

We first deal with the change in the variance, that is, the first term on 
the right in (22). 



Let a n = ^wln(l + %), and / and g be, respectively, the density func- 

n V T n 

tions of JV(0, Ik n ) and A/"(0, 2 + i lk n )- Let U be a random variable following 
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the chi-square distribution with k n degrees of freedom x 2 (^n)- Let V Q T Q 
be a square root of the matrix <3? T <3?. The total variation norm is invariant 
under the bijective affine map 9 i— > —V & T &(6 2+ z ^y), so 

J2 



\\W(dO\Y)-Q\\ 



TV 



AT(0,4J-AT 



u n ' ' n 



(<?-/)- 



"TV 

(g(x)-f(x))d n x 



/ 2 2 



' n 



kJad 



1 < 



x\\<y/k^a n 

P(U < k n a 2 n 

U-k 



kr, 



^ yk n 



As n goes to infinity, n- 3 converges toward AA(0, 1) in distribution. Using 
the Taylor expansion of In, we find 



and, therefore, 



2r 2 



\ r 2 



f k n {a 2 n - 1) 



n- 2 + t 2 



A- ^~ 
n 2r 2 ' 

n 2r 2 " 



Since k n = o^/a^), both these quantities go to 0. As a consequence, || W{d6\ 
Y) — Q||tv goes to zero as n goes to infinity. 

Let us now deal with the centering term, that is, the second term on the 
right in (22). 

Lemma 4. Let U be a standard normal random variable, let k > 1 and 
let ZeR k . Then 

||AA(0,/ fc ) -M(Z,I k )\\ TY = P(\U\ < \\Z\\/2) < \\Z\\/Vto. 
Proof. Let g be the density of Af(0, h). Then 

||AA(0,4)-AA(Z,4)||tv= [ (g(x)-g(x-Z)) + d k x 

{g{x)-g(x-Z))d k x 

l{2x T Z<\\Z\\ 2 } 

= P(U < \\Z\\/2) - P(U + \\Z\\ < \\Z\\/2) 
< \\Z\\/V2k. 

The last line comes from the density of M(0, 1) being bounded by 1 j\f2~n. □ 
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Using again the invariance of the total variation norm under the bijective 
affine map 9 ^ ±VW$(e - ^My) 

||Af(9y,<Tj(# T *)- 1 )-Q|| T v 



'n ~ n 



TV 



< -. \\$6y\ 



e T Ee is a random variable following <7^y 2 (fc n ) distribution. By Jensen's in- 
equality, E[Ve T T,e] < \J E[e T T,e] = a n \fk~n- Therefore, 

E\\M{e Y ,al^ T ^r l ) - QHtv < -^-^-jdlFoll + 

V^vr T n -t- cr n 



which goes to zero under the assumptions of Theorem 1. 

To conclude the proof, note that we deduce the results on W(dF\Y) from 
the ones on W(d0\Y), by the linear relation F = 

6.2. Proof of Theorem 2. We make the proof for W(dQ\Y). Then the 
result for W(dF\Y) is immediate. Our method is adapted from Boucheron 
and Gassiat (2009). 

To any probability measure P on R fcn , we associate the probability 

( } " P{£e AM)) 

with support in £g a ^(M). It can be easily checked that 
(24) \\P-P M \\ TV =P(£^(M)). 

The proof is divided into three steps based on the use of M n as 
a threshold to truncate the probability distributions. Lemma 5 below 
controls E\\N(6 Y , a 2 n ($ T $)" 1 ) - M Mn (9 Y ,al(^ T ^y x )\\ry , Lemma 6 con- 
tiol^E\\W Mn (de\Y)-M Mn (e Y ,o-l(^ T ^)- 1 )\\TY and Proposition 9 controls 
E\\W{d9\Y) - W Mn (d9\Y)\\ TY . Taken together, these results give Theo- 
rem 2. 

Lemma 5. Ifk n <iM n , then 

E\\N"(0 Y ,o*($ T *)- 1 ) -N Mn {9 Y , o-l($ T Q)- 1 )^ < 2e -(v / ^-2 v ^)V8_ 
If k n = o(M n ), for n large enough, this bound can be replaced by e~ A/ ™/ 9 . 
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Proof of Lemma 5. To control this quantity, we consider two cases, 
depending on whether 9 Y is near or far from 9$: 

(25) 

+M(e y n ($ T $r 1 )(s$ 0> z(M n /4)). 

Let U be a random variable following a x 2 (&n) distribution. Taking the 
expectation on both sides of (25) gives 

E\Me Y ,rt(* T *)- 1 ) - Af Mn (9 Y ,CT 2 n ($ T $r 1 )\\TV <2P(U > M n /4). 
Now, Cirelson's inequality [see, e.g., Massart (2007)] 
(26) P(VU > \fh n + \[2x) < exp(-x) 

used with x = ^ AIn ^xE™) implies Lemma 5. □ 

Lemma 6. If sup^ H 2< a 2 Mn ^ g p< a 2 Mn ^gg^j -»• 1 asn^oo, then 

E\\W Mn (d9\Y) -Af M "(9 Y ,al(^ T ^)- 1 )\\ TY ^0 asn^oo. 

Proof. Let us first note that, for every 9 and r in R fcn , for every Y G R n , 
dP e (Y) f || 2 + II$t|| 2 - 2Y t <S>(t - 6 



dP T {Y) GXP l 2al 

(27) 

dM(9 Y ,al^ T ^)- 1 ){9) 
dM(9 Y ,al(^ T ^)- 1 )(T)' 

This directly comes from the expressions for the Gaussian densities. 

In the following the first lines are just rewriting. Then we use Jensen's 
inequality with the convex function x i— >■ (1 — a;)+, and make use of (27). We 
abbreviate M M " (9 Y , o%(<f> T <f>)- 1 ) into N Mn : 



\W M "(d9\Y) -M A 



I TV 



f(i- y M " W ) iw**W) 

J V dW M "(9\Y)J + 

dAf Mn (9) !{w{r)/dM M - (r)) dP T (y) dAf Mn (r) \ ,^M„,. m 
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<tt(i 



1 



w{e)dN M ^{T)dP e {Y)) + [)aw 11 j 

^ff] dM Mn {r)dW Mn {9\Y) 



w{9) i 

w(9 + h) 



< 1 - inf 

\\9h^<o*M n ,\\*g^<d*M n w(9q +g) □ 

Proposition 9 (Posterior concentration). Suppose that conditions (1), 
(2) and (3) of Theorem 2 hold. Then 

E\\W(dd\Y)-W M -(d9\Y)\\ Ty = E[W(Sg^(M n )\Y)} 

— > as n — > oo. 

Proposition 9 is proved in Appendix A in the supplemental article [Bon- 
temps (2011)]. However, we state here the following important lemma, be- 
cause of its significance. 

Lemma 7. Let a £ M n such that <S> T a = 0. Then, for any y £ E n , W(dF\ 
Y = y) = W(dF\Y = y + a). 

Lemma 7 states that the distribution W{dF\Y) is invariant under any 
translation of Y orthogonal to {(j)). Now, regard W{dF\Y) as a random 
variable. Then any statement on W(dF\Y) or W(dO\Y) valid when Y ~ 
M(Fq, o-^I n ) with Fq £ ((f>) can be extended at zero cost by Lemma 7 to the 
case Fq £ M. n . For instance, proving Proposition 9 in the case Fq = $(9o is 
enough. 

6.3. Proof of Theorem 3. We begin with (13). Consider the following 
Taylor expansion: 

G(F)-G(Y {(j)) ) 

= G Fw (F-Y w ) 

+ 1 jf (1 - t)Dl w+t(F _ Fw) G(F - F W , F - F {4>) )dt 

- \ I (l-t)D Fw+t[Yw _ Fw) G(Y {<j>) -F^)dt 

J 

using the Lagrange form of the error term. Suppose that F £ ((f)), \\F — 
F w || 2 < a 2 n M n and \\Y^ } - F {<j)) f < a 2 n M n . Then, for any b £ W, 

\b T (G(F) - G(Y W ) - G Fw (F - Y {4>) ))\ < \\b\\B Fw (M n ). 
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On the other hand, x /b T TF w b > X /\\T 



F || l \\b\\. Moreover, 



W\ d 



VG Fw {F-Y {4>) ) 



Y -AT(0,1) 



TV 



< 



\\W(dF\Y)-Af(Y w ,a 2 n Z)\\ 



TV 



Let r] n = ^ \\T p 1 \\B F , . (M n ), which tends to by hypothesis. Let also 



I Vn = {x G R: 3x' G I, \x — x'\ < r] n }. 

Note that < VOO + V? Vn- 

Gathering all this information, we can get the upper bound 



W 



V (G(F)-G(y w )) cf 

b T r Fu ,b 



Y 



< W 



¥G Fw (F-Y m ) 



b T r Fw b 



el 



In 



Y 



h\Y w - Fw p>alM n + W(\\F - F W f > a 2 n M n \Y) 



< ^(J) + + \\W(dF\Y) -M(Y {<l>) ,a 2 Z)\\ TY 

+ i||y w -F w i^>^M„ + wCII* 1 - f (0> f > a 2 M n |y). 

A lower bound is obtained in the same way. Taking the expectation, 
'b T {G{F)-G{Y^)) 



E 



Gl 



Y 



(28) 



<o(l) + P(\\Y {(j>) -F w \\ 2 >a 2 n M n ) 



+ E[W(\\F-F m \\ 2 >a 2 n M n \Y)]. 
But \\Y^\ — F^\\\ 2 follows the cr 2 x 2 (k n ) distribution, and since k n = o(M n ), 

P(\\Y {(j>) -F w \\ 2 >a 2 M n )=o(l). 
To bound (28), we use the following: 

Lemma 8. Suppose that the conditions of either Theorems 1 or 2 are 
satisfied. Then 

E[W{\\F - F w f > o- 2 M n \Y)} asn^oo. 
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Proof. For smooth priors, this is an immediate corollary of Proposi- 
tion 9. Let us suppose we are under the conditions of Theorem 1. 

2 2 

Let Z be a A/"(0, 2 " n 2 S) random vector in W 1 independent on Y, and U 
a random variable following x 2 (k n ). From (21) we get 



W{dF\Y) =M 



n _y 

2 (4>) ' _2 



0"n + Tn 



ct 2 t 2 

"n'n i 



u n ^ 1 n 



Therefore, 



W(\\F-F w \\ 2 >a 2 n M n \Y) 



z+ Tn 



2 Y (<P) 



F, 



WMn 
^2 



>a 2 n M n 



O + T 



2 Y (<t>) 



F 



(<t>) 



< < 



1, 



if 



n v 
2 Y {4>) 



< + T n 



F 



(<t>) 



P ll^ll 2 >^ 



9 



2q raV / M^ 
3 

rl 9 



otherwise. 

Since k n = o(M n ), P(U > M n /9) = o(l). On the other hand, 



<7 2 + T 



2 Y W 



a z + t 



< llSell + 



e + 



^TT 2 " 

u n ' n 



Since ||F || = o(T%/a n ), ""fl+^f = o(l) < ^ for n large enough. ||Se|| 2 is 
a o" 2 x 2 (fe n ) variable. Therefore, for n large enough, 

E[W(\\F - F {<j>) f > a 2 n M n \Y)] < 2P(U > M n /9) = o(l). □ 



Now, (28) gives (13). 

The proof of the frequentist assertion (14) is similar and delayed to Ap- 
pendix C in the supplemental article [Bontemps (2011)]. 

Acknowledgments. The author would like to thank E. Gassiat and I. Cas- 
tillo for valuable discussions and suggestions. 



SUPPLEMENTARY MATERIAL 

Supplement to "Bernstein von Mises theorems for Gaussian regression 
with increasing number of regressors" (DOI: 10.1214/11-AOS912SUPP; 
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.pdf). This contains the proofs of various technical results stated in the 
main article "Bernstein-von Mises Theorems for Gaussian regression with 
increasing number of regressors." 
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